A multi-source molecular network representation model for protein–protein interactions prediction

The prediction of potential protein–protein interactions (PPIs) is a critical step in decoding diseases and understanding cellular mechanisms. Traditional biological experiments have identified plenty of potential PPIs in recent years, but this problem is still far from being solved. Hence, there is urgent to develop computational models with good performance and high efficiency to predict potential PPIs. In this study, we propose a multi-source molecular network representation learning model (called MultiPPIs) to predict potential protein–protein interactions. Specifically, we first extract the protein sequence features according to the physicochemical properties of amino acids by utilizing the auto covariance method. Second, a multi-source association network is constructed by integrating the known associations among miRNAs, proteins, lncRNAs, drugs, and diseases. The graph representation learning method, DeepWalk, is adopted to extract the multisource association information of proteins with other biomolecules. In this way, the known protein–protein interaction pairs can be represented as a concatenation of the protein sequence and the multi-source association features of proteins. Finally, the Random Forest classifier and corresponding optimal parameters are used for training and prediction. In the results, MultiPPIs obtains an average 86.03% prediction accuracy with 82.69% sensitivity at the AUC of 93.03% under five-fold cross-validation. The experimental results indicate that MultiPPIs has a good prediction performance and provides valuable insights into the field of potential protein–protein interactions prediction. MultiPPIs is free available at https://github.com/jiboyalab/multiPPIs.

a graph representation method, we extract association information from this network.We then utilize 19,237 known PPI pairs from the STRING database (2017) 30 as our positive dataset.A matching number of random non-interacting pairs form the negative dataset.These datasets are combined to create our final training set.The prediction model is constructed using a Random Forest (RF) classifier, optimized for best performance.The process flow of MultiPPIs is outlined in Fig. 2. In our study, the proposed model, under fivefold cross-validation, achieves an average accuracy of 0.8603 and an AUC of 0.9304.These results are better than many current computational methods.We also compared two feature combination strategies.Our method is more effective than using only protein sequence information by combining multiple types of data.Additionally, we test four popular classifiers and find the Random Forest classifier to be the most suitable for our model, offering superior prediction performance.These experiments demonstrate that our model is an efficient tool for predicting potential protein-protein interactions.Compared with previous computational methods [8][9][10][11][12] , our method mainly has the following specific advantages: (1) Considering the holistic nature of biomolecular networks, our method collects a large amount of association data to construct a multi-source molecular network, and extracts the higher-order network features of proteins based on the graph representation learning method to improve the accuracy of the prediction of PPIs.(2) Our method fully takes advantage of the local property of residues in protein sequences and describes the level of correlation between two protein sequences based on their specific physical and chemical properties.This not only improves the prediction performance of our method, but also solves the cold-start problem often encountered by graph neural network-based methods.(3) By conducting extensive experiments, including comparison of feature combinations, comparison of classification models, optimization and adjustment of model parameters, and comparison with previous experimental methods, our method has been confirmed to have excellent performance in predicting PPIs and is better than most previous computational methods.

Results and discussion
The five-fold cross-validation performance of our proposed model Cross-validation is a standard method used in machine learning to construct and validate model parameters.
In this work, fivefold cross-validation was adopted to evaluate the performance of our model.First, we equally divided the sample data into five parts.Second, we sequentially selected four parts as the training set and the remaining 1 part as the test set.The experiment repeated 5 times.Finally, six standard parameters were used as evaluation indicators for our experiments, including specificity (Spec.),Matthews's correlation coefficient (MCC), precision (Prec.),sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC).Table 1 lists the detailed results of each validation.The last line shows the average value and the standard deviation of the results across five runs of the classifier.These experimental results demonstrated that our model could achieve good results and stability in the protein-protein interaction prediction.The Receiver Operating Characteristic (ROC) curve is an essential and common statistical analysis tool widely used to judge the quality of classification and prediction results in medical research and machine learning.It first sorts the samples according to the prediction results of the classifier and then predicts the samples as positive samples one by one in this order.This way calculates two important values (True Positive Rate, False Positive Rate) each time and plots them as the horizontal and vertical coordinates, respectively.Besides, the AUC is defined as the areas under the ROC curve, and its value range is generally between 0.5 and 1.Generally, the ROC curve cannot indicate which classifier has better performance, so the AUC value is selected as the evaluation index.The classifier with a larger AUC has better performance.The Precision-Recall (PR) curve is another tool to evaluate the performance of a classifier.For the category imbalance problem, the PR curve is widely considered superior to the ROC curve.Similarly, the AUPR is defined as the areas under the PR curve.Figures 3  and 4 respectively show our method's ROC and PR curves under fivefold cross-validation.These results once again demonstrated our model's good effect and stability in predicting potential protein-protein interactions.

Compare the effect of our feature combination strategy
To further compare the effect of our feature combination strategy, a different feature combination was utilized to represent protein nodes.More specifically, we used the only protein sequence features (combination 1) and the combination of the protein sequence features and the multi-source associated information of proteins used by MultiPPIs (combination 2) to represent proteins before carrying out the fivefold cross-validation experiment.One important thing that must be mentioned is that the experimental environment of the two different combinations is the same to ensure the fairness of comparison.Table 2  of combination 1 under the fivefold cross-validation experiment.The experiment results of combination 1 is shown in Table 1.Figures 5 and 6, respectively, show the comparative experiment's ROC curves and PR curves.
As the results show, our feature combination strategy performs better than most computational methods that only use protein sequence features.This once again proves that the multi-source association information with other biomolecules of proteins is helpful for protein-protein interaction prediction.

Compare the effect of different classifiers
To choose the most suitable classifier for our model, we conducted a comparison experiment with the four most commonly used classifiers, including Decision Tree, Naive Bayes, KNN, and Random Forest.We used these four classifiers with default training parameters to train and predict the protein-protein interactions and kept other experimental conditions consistent.Finally, the Random Forest classifier performed better by observing the

Compare the effect of random forest classifier parameter
Random Forest (RF) is a flexible and efficient supervised learning algorithm Breiman proposed in 2001.This algorithm has achieved good results in solving problems in many fields.It has the characteristics of preventing overfitting, strong model stability, and easy to deal with nonlinear regression problems.It is also a particular bootstrap aggregating (bagging) method which uses the decision tree as the training model.It first uses the bootstrap method to generate training sets and then constructs a decision tree for each training set.Finally, all these decision trees are combined to form the classifier to improve the overall effect.Additionally, when segmenting node features, the Random Forest method does not select all features that can maximize the index (such as information gain).Instead, it randomly extracts a subset of features and then finds the optimal solution within this subset.For the Random Forest model parameters, we need to set the regression tree number N. In detail, and we started to train the model at an interval of 20 from N = 180 and observed the relationship between the number of N and the final prediction accuracy.We terminated the model training if the prediction accuracy decreased with the increase of N.

Performance comparison with the state-of-the-art methods
To further evaluate the effectiveness of MultiPPIs, we conduct a detailed comparative analysis between it and several existing protein-protein interaction prediction methods, including LR_PPI 31 , DPPI 32 , WSRC_GE 33 , LPPI 34 and PIPR 35 .Our evaluation framework encompasses five distinct performance metrics, as detailed in Table 5.These metrics include specificity (Spec.),Matthews's correlation coefficient (MCC), precision (Prec.),sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC), providing a comprehensive view of each method's predictive capabilities.Our findings reveal a significant enhancement in performance with MultiP-PIs.This substantial leap in accuracy underscores the effectiveness of MultiPPIs in identifying protein-protein interactions, marking a notable advancement in the field.

Protein sequence features based on the physicochemical properties of amino acids
In this study, we downloaded the sequence information of proteins from the STRING: in 2017 30 database.Proteins are biopolymers composed of up to 20 different amino acids as basic units.The sequence of amino acid residues in the peptide chain is called the primary structure of proteins.Consequently, we selected the six physicochemical properties of amino acids to represent the protein sequence features in this work, including   polarity (P1), hydrophobicity (H), net charge index of side chains (NCISC), volumes of side chains of amino acids (VSC), solvent-accessible surface area (SASA) and polarizability (P2).The original physicochemical values of these 20 amino acids are listed in Table 6.

Performance evaluation criteria for our experiments
In order to verify the quality of our proposed method, six standard parameters were calculated as evaluation indicators for our experiments, including specificity (Spec.),Matthews's correlation coefficient (MCC), precision (Prec.),sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC).The description of all computational formulas is as follows: where TN, FN, TP, and FP represent the total number of true negative, false negative, true positive, and false positive.Furthermore, the AUC (the area under the ROC curve) was also implemented to evaluate the performance of our model.

Auto covariance (AC) method
The extraction of protein sequence features using the auto covariance (AC) method was completely proposed by Guo et al. 36 .This method fully takes advantage of the local property of residues in protein sequences and describes the level of correlation between two protein sequences based on their specific physical and chemical  www.nature.com/scientificreports/properties [37][38][39] .First, we normalized the original physicochemical values of 20 amino acids to unit standard deviations (SD) and zero means according to Eq. ( 1): where P ij is the j th descriptor value for i th amino acid, P j is the mean of j th descriptor over the 20 amino acids and S j is the corresponding standard deviations, given by: In this way, each amino acid in a protein sequence is converted to the corresponding standardized physicochemical value.Then, the AC method is used to encode the protein sequence into a feature vector: where X i,j is the j th descriptor value of the i th amino acid, N is the length of the protein sequence, d is the width of the sliding window.In this article, the parameters d and j are respectively set to 30 and 6.On this basis, a protein sequence is finally encoded as a 30*6 = 180-dimensional feature vector.

The multi-source molecular network construction
In order to utilize the associated information of proteins with other biomolecules, we systematically and comprehensively constructed the association information network by integrating the known associations among proteins, diseases, miRNAs, drugs, and lncRNAs, which were downloaded from multiple databases.The source and version of the raw data are shown in Table 7 below.In addition, we have done some operations with the raw data, such as removing some irrelevant items and unifying the identifiers.Besides, we also counted the number of nodes contained in the original association data, as shown in Table 8.

DeepWalk algorithms
In order to extract the associated information feature of proteins from the association information network we constructed, the graph embedding algorithms: DeepWalk 29 was adopted in our work.The input of the DeepWalk method is a graph or network, and then the social representation of vertices in the network was learned through (6) the truncated random walk and the SkipGram model.Finally, it outputs the potential relationship of vertices in the network.The basic idea of this algorithm is first to obtain the node sequence as a sentence through the random walk, and then to obtain the local information of the network from the truncated random walk sequence by maximizing the co-occurrence probability of vertex v j within a window size w to learn the potential represen- tation of the node based on the local information, which is calculated as follows: where �(v j ) indicates that vertex v j is mapped to its representation space, ϕ(b k ) means the parent node of the tree node b k .More specifically, the entire DeepWalk method is mainly composed of two algorithms.Algo- rithm 1 of the DeepWalk model mainly includes 4 steps: (1) Generate γ random walks for each node in the input network structure.(2) Uniformly samples a point in the network as the root node in each random walk process.
(3) Uniformly select the neighbor node as the next node from the current node.(4) Repeat the above steps until the walking path reaches the specified length.Algorithm 2 of the DeepWalk model is to perform the SkipGram model for training the sequence data to get the network feature vector of each node.The SkipGram model iters all possible matches within a window for the random walk sequence.It utilizes nodes to assume the context and discovers the representation of the vector by achieving the maximum co-occurrence probability of words in a window while neglecting the order in which the nodes occur in the sentence.According to the independent presumption, the probability of co-occurrence can be transferred into the conditional probability product.The detailed process of the algorithm is respectively shown in Tables 9 and 10.In this way, the associated information with other biomolecules of proteins in the association information network is converted to the feature vector, which can be used by the machine learning classifiers.

The representation of protein nodes
In this study, the protein nodes were represented by the combination of the physicochemical features of protein sequences and multi-source association information with other biomolecules (drugs, miRNAs, lncRNAs, and diseases) of proteins in the association information network.The sequence feature of proteins was obtained by the auto-covariance (AC) method based on the six physicochemical properties of amino acids.Besides, the associated information with other nodes of proteins was obtained by the network representation method DeepWalk based on the association information network we constructed.Finally, we combined these two features to represent the protein-protein interaction pairs.
Table 7.The data information in the multi-source molecular network.

Conclusion
The protein-protein interactions (PPIs) play a vital role in the cell biochemical reaction network and are significant for regulating cells and their signals.However, the traditional biological experiment methods have the limitations of a high time-consuming and long period, which is not suitable for large-scale protein-protein interactions prediction.In this study, we proposed a novel computational method to predict potential PPIs by combining the sequence feature and associated information with other molecules of proteins.For the sequence feature of proteins, we utilized the auto covariance (AC) method to extract it based on the six physicochemical properties of amino acids.For the association information feature with other molecules of proteins, we utilized the DeepWalk network representation method to extract it based on the association information network we constructed.In this way, the proteins were represented by combining these two features.Finally, the Random Forest classifier and its corresponding optimal parameters were selected for training and prediction.As a result, our proposed method achieved average accuracy and AUC of 86.03% and 93.03% under fivefold cross-validation, which is superior to many existing computational models.Besides, to evaluate the effect of our feature combination, we further compared the performance of only the protein sequence feature and the combination of protein sequence and association feature.Furthermore, to select the most suitable classifier for our model, we also compared the ability of the four most commonly used classifiers.While overcoming many challenges, our current method still has its limitations.In our work, we collected 8 associations between 5 biological molecules to construct a multi-source molecular network.All the proteins in our dataset are distributed on this network.Therefore, we are able to utilize the relationships between different molecules to extract the network features of protein nodes.Note that we have removed known protein-protein interactions during training to avoid causing label leakage.An independent test set, completely independent of the existing dataset, would result in the inability to use molecular network relationships.We designed our model to address this limitation by considering both

Figure 2 .
Figure 2. The flowchart of our proposed model.

Figure 4 .
Figure 4.The PR curves and AUPR values of our model under fivefold cross-validation.

Figure 5 .
Figure 5.The ROC curves and AUC values of two different feature combination strategies.(A) the ROC curves and AUC values of the only protein sequence features.(B) The ROC curves and AUC values of the combination of protein sequence features and the multi-source associated information of proteins.(C) Comparison of the ROC curves and AUC values of two different feature combination strategies.

Figure 6 .
Figure 6.The PR curves and AUPR values of two different feature combination strategies.(A) The PR curves and AUPR values of the only protein sequence features.(B) The PR curves and AUPR values of the combination of protein sequence features and the multi-source associated information of proteins.(C) Comparison of the PR curves and AUPR values of two different feature combination strategies.

Figure 7 .Figure 8 .
Figure 7.The ROC curves and AUC values of different classifiers.(A) The ROC curves and AUC values of the Decision Tree classifier.(B) The ROC curves and AUC values of the KNN classifier.(C) The ROC curves and AUC values of the Naive Bayes classifier.(D) The ROC curves and AUC values of the random forest classifier.(E) Comparison of the ROC curves and AUC values of different classifiers.

lists the results of the experiment results Table 1. The fivefold cross-validation results of our proposed model.
Figure 3.The ROC curves and AUC values of our model under fivefold cross-validation.

Table 2 .
The results of different feature combinations under fivefold cross-validation.

Table 3
lists the average parameter values of different classifiers under fivefold cross-validation.Figures7 and 8, respectively, show the ROC and PR curves of the comparative experiment.The comparison experiment results proved that the Random Forest is more suitable for our model than other classifiers, especially in terms of the AUC and accuracy, which can represent the ability of a model.
Table 4 lists the accuracy results of the Random Forest classifier with different N parameters under fivefold cross-validation.As a result, we can see that the Random Forest classifier has the best performance when the number of regression trees is 300.

Table 3 .
The average parameter values of different classifiers under fivefold cross-validation.

Table 4 .
The accuracy results of the Random Forest classifier with different N parameters.

Table 5 .
Performance comparison of MultiPPIs with the state-of-the-art methods.

Table 6 .
The original physicochemical values of 20 amino acids.

Table 8 .
The node information in the multi-source molecular network.

Table 9 .
Algorithm 1 of the DeepWalk model.

Table 10 .
Algorithm 2 of the DeepWalk model